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Abstract 

We consider a linear stochastic bandit prob- 
lem where the dimension K of the unknown 
parameter 9 is larger than the sampling bud- 
get n. Since usual linear bandit algorithms 
have a regret of order O(Ky^n), it is in gen- 
eral impossible to obtain a sub-linear regret 
without further assumption. In this paper 
we make the assumption that 9 is S— sparse, 
i.e. has at most S— non-zero components, and 
that the set of arms is the unit ball for the 
||.|| 2 norm. We combine ideas from Com- 
pressed Sensing and Bandit Theory to derive 
an algorithm with a regret bound in 0(Sy/n). 
We detail an application to the problem of 
optimizing a function that depends on many 
variables but among which only a small num- 
ber of them (initially unknown) are relevant. 



Introduction 

We consider a linear stochastic bandit problem in high 
dimension K . At each round t, from 1 to n, the player 
chooses an arm Xt in a fixed set of arms and receives a 
reward r< = (xt,9 + rjt), where 9 £ M. K is an unknown 
parameter and r\t is a noise term. Note that r t is a 
(noisy) projection of on i ( . The goal of the learner 
is to maximize the sum of rewards. 

We are interested in cases where the number of rounds 
is much smaller than the dimension of the parameter, 
i.e. n <C K. This is new in bandit literature but useful 
in practice, as illustrated by the problem of gradient 
ascent for a high-dimensional function, described later. 

In this setting it is in general impossible to estimate 9 
in an accurate way (since there is not even one sample 
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per dimension). It is thus necessary to restrict the 
setting, and the assumption we consider here is that 9 
is S-sparse (i.e., at most S components of 9 are non- 
zero). We assume also that the set of arms to which Xt 
belongs is the unit ball with respect to the ||.||2 norm, 
induced by the inner product. 

Bandit Theory meets Compressed Sensing 

This problem poses the fundamental question at the 
heart of bandit theory, namely the exploration^] versus 
exploitation^! dilemma. Usually, when the dimension 
K of the space is smaller than the budget n, it is pos- 
sible to project the parameter 9 at least once on each 
directions of a basis (e.g. the canonical basis) which 
enables to explore efficiently. However, in our setting 
where K 3> n, this is not possible anymore, and we 
use the sparsity assumption on 9 to build a clever ex- 
ploration strategy. 



Compressed Sensing (see e.g. dCandes and Tao . 2007 



Chen et al. . 1999t Blumensath and Daviesl . 2003) ) pro 
vides us with a exploration technique that enables to 
estimate 9, or more simply its support, provided that 
9 is sparse, with few measurements. The idea is to 
project 9 on random (isotropic) directions Xt such that 
each reward sample provides equal information about 
all coordinates of 9. This is the reason why we choose 
the set of arm to be the unit ball. Then, using a regu- 
larization method (Hard Thresholding, Lasso, Dantzig 
selector...), one can recover the support of the param- 
eter. Note that although Compressed Sensing enables 
to build a good estimate of 9, it is not designed for the 
purpose of maximizing the sum of rewards. Indeed, 
this exploration strategy is uniform and non-adaptive 
(i.e., the sampling direction xt at time t does not de- 
pend on the previously observed rewards n, . . . , Tt—i). 



On the contrary, Linear Bandit T heor y (se e 



e.g. [Rusmevichientong and Tsitsikli; 

Dani et al.l (|2008f ); iFilippi et ahl $201 




(120081); 
and the 



1 Exploring all directions enables to build a good esti- 
mate of all the components of 8 in order to deduce which 
arms are the best. 

2 Pulling the empirical best arms in order to maximize 
the sum of rewards. 
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recent work by Ebb asi-vadkori et al.1 ad- 
dresses this issue of maximizing the sum of rewards 
by efficiently balancing between exploration and 
exploitation. The main idea of our algorithm is to use 
Compressed Sensing to estimate the (small) support 
of 9, and combine this with a linear bandit algorithm 
with a set of arms restricted to the estimated support 
of 9. 

Our contributions are the following: 

• We provide an algorithm, called SL-UCB (for 
Sparse Linear Upper Confidence Bound) that 
mixes ideas of Compressed Sensing and Bandit 
Theory and provide a regret bouncjfl of order 

• We detailed an application of this setting to the 
problem of gradient ascent of a high-dimensional 
function that depends on a small number of rel- 
evant variables only (i.e., its gradient is sparse). 
We explain why the setting of gradient ascent can 
be seen as a bandit problem and report numerical 
experiments showing the efficiency of SL-UCB for 
this high-dimensional optimization problem. 

The topic o f sparse linear bandits is also considered in 



the paper (|Abbasi-vadkori et al.l . 120121 ) published si- 



multaneously. Their regret bound scales as O(VKSn) 
(whereas ours do not show any dependence on K) but 
they do not make the assumption that the set of arms 
is the Euclidean ball and their noise model is different 
from ours. 

In Section [1] we describe our setting and recall a result 
on linear bandits. Then in Section [2] we describe the 
SL-UCB algorithm and provide the main result. In 
Section [3] we detail the application to gradient ascent 
and provide numerical experiments. 

1 Setting and a useful existing result 

1.1 Description of the problem 

We consider a linear bandit problem in dimension K. 
An algorithm (or strategy) Alg is given a budget of 
n pulls. At each round 1 < t < n it selects an arm 
Xt in the set of arms Bk, which is the unit ball for 
the ||.||2-norm induced by the inner product. It then 
receives a reward 

r t = (x t ,6 + %), 

where rjt € M. K is an i.i.d. white nois^ that is indepen- 
dent from the past actions, i.e. from |(ai t <) t << t |, and 



is an unknown parameter. 



We define the notion of regret in Section [T] 
4 This means that K m (?7fc,t) = for every (fc, i), that the 
(Vk,t)k are independent and that the (r)k,t)t are i.i.d.. 



We define the performance of algorithm Alg as 

n 

L n (Alg) = J2( d ^t)- (1) 



Note that L n (Alg) differs from the sum of rewards 
Y^t=i r t but is close (up to a 0(^/n) term) in high 
probability. Indeed, X)t=i( r ? t ' x t) ^ s a Martingale, 
thus if we assume that the noise rjk.t is bounded by 
iff/; (note that this can be extended to sub-Gaussian 
noise) , Azuma's inequality implies that with probabil- 
ity 1 - 8, we have J2t=i r * = L * i-Alg) + J2t=i (Vt > x t) < 
L n (Alg) + V21og(lA5)||a|| 2 ^. 

If the parameter 9 were known, the best strategy Alg* 



would always pick x* = argmax^gg^, 
obtain the performance: 

L n (Alg*) =n\\9\\ 2 



i,x) 



and 



(2) 



We define the regret of an algorithm Alg with respect 
to this optimal strategy as 



R n (Alg) = L n {Alg*) - L n (Alg). 



(3) 



We consider the class of algorithms that do not know 
the parameter 9. Our objective is to find an adap- 
tive strategy Alg (i.e. that makes use of the history 
{(xi,ri), . . . , (xt—i, ft-i)} at time t to choose the next 
state Xt) with smallest possible regret. 

For a given t, we write Xt — (x\\ . . . ; xt) the matrix 
in M. K xt of all chosen arms, and Rt = (n, . . . , rt) T the 
vector in R 4 of all rewards, up to time t. 

In this paper, we consider the case where the dimen- 
sion K is much larger than the budget, i.e., n K. 
As already mentioned, in general it is impossible to 
estimate accurately the parameter and thus achieve a 
sub-linear regret. This is the reason why we make the 
assumption that 9 is S— sparse with S < n. 

1.2 A useful algorithm for Linear Bandits 

We now recall the algorithm C on f idenceBalli (ab- 
breviate by CB2) introduced in lDani et al.l ( 2008f ) and 
mention the corresponding regret bound. CB2 will 
be later used in the SL-UCB algorithm described in 
the next Section to the subspace restricted to the es- 
timated support of the parameter. 

This algorithm is designed for stochastic linear bandit 
in dimension d (i.e. the parameter 9 is in M. d ) where d 
is smaller than the budget n. 

The pseudo-code of the algorithm is presented in Fig- 
ure [TJ The idea is to build an ellipsoid of confidence for 
the parameter 9, namely B t — {v : \\v— Qt\\%,A t < \fWt\ 
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Input: Bd, 5 
Initialization: 

Ax = I d , k = 0, Pt = 128d(log(n7«5)) 2 . 

for t = 1, . . . , n do 

Define B t = {u : \\v - 6 t \\i,A t < -/Wt) 
Play x t = argmaxajgi^ max„ 6 B t {i/,i). 
Observe r t = (x t ,9 + r/ t ). 

Set A t+ i = A t + a; t a4 6 t+1 = A^XtRt 
end for 





(b, e t > 












B d 



Figure 1: Algorithm ConfidenceBalh (CB2) adapted for an action set of the form Bd (Left), and illustration 
of the maximization problem that defines x t (Right). 



where ||u||2,A = u T Au and 9t = A^ 1 X t ~\Rt-\, and to 
pull the arm with largest inner product with a vector 
in B t , i.e. the arm x t = argmax 3 . g g ti max„ g 5 t (i/, x). 

Note that this algorithm is intended for general shapes 
of the set of arms. We can thus apply it in the particu- 
lar case where the set of arms is the unit ball Bd for the 
||.||2 norm in M. d . This specific set of arms is simpler 
for two reasons. First, it is easy to define a span of the 
set of arms since we can simply choose the canonical 
basis of M. d . Then the choice of Xt is simply the point 
of the confidence ellipsoid B t with largest norm. Note 
also that we present here a simplified variant where 
the temporal horizon n is known: the original version 
of th e algorithm is any time. We now recall Theorem 
2 of (|Dani et all l2008h . 

Theorem 1 (ConfidenceBal^) Assume that (r]t) is 
an i.i.d. white noise, independent of the (xf)t'<t and 
that for all k — {1, . . . ,d}, 3o~k such that for all 
t> \Vt,k\ < For large enough n, we have with 

probability 1 — S the following bound for the regret of 
C onf idenceBalli(B d, S): 

R n (Alg C B 2 ) < 64d(||0|| 2 + ||<7|| 2 )(log(n 2 /^)) 2 V^- 

2 The algorithm SL-UCB 

Now we come back to our setting where n <C K . We 
present here an algorithm, called Sparse Linear Upper 
Confidence Bound (SL-UCB). 

2.1 Presentation of the algorithm 

SL-UCB is divided in two main parts, (i) a first non- 
adaptive phase, that uses an idea from Compressed 



Sensing, which is referred to as support exploration 
phase where we project 9 on isotropic random vec- 
tors in order to select the arms that belong to what 
we call the active set A, and (ii) a second phase that 
we call restricted linear bandit phase where we apply a 
linear bandit algorithm to the active set A in order to 
balance exploration and exploitation and further min- 
imize the regret. Note that the length of the support 
exploration phase is problem dependent. 

This algorithm takes as parameters: (72 and 62 which 
are upper bounds respectively on ||<t||2 and ||0||2, and 
6 which is a (small) probability. 

First, we define an exploring set as 

Exploring — ~T^{ L • (^) 

v K 

Note that Exploring C Bk- We sample this set uni- 
formly during the support exploration phase. This 
gives us some insight about the directions on which the 
parameter 9 is sparse, using very simple concentration 
tooltll at the end of this phase, the algorithm selects 
a set of coordinates A, named active set, which are 
the directions where 9 is likely to be non-zero. The al- 
gorithm automatically adapts the length of this phase 
and that no knowledge of ||0||2 is required. The Sup- 
port Exploration Phase ends at the first time t such 
that (i) maxfe \0k,t\ — -7? > for a well-defined constant 



We then exploit the information collected in the first 
phase, i.e. the active set A, by playing a linear ban- 
dit algorithm on the intersection of the unit ball Bk 

5 Note that this idea is very similar to the one of Com- 
pressed Sensing. 
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and the vector subspace spanned by the active set A, 
i.e. Vec(A). Here we choose to use the algorithm CB 2 
described in ()Dani et al. , 2008). See Subsection O for 
an adaptation of this algorithm to our specific case: 
the set of arms is indeed the unit ball for the ||.||2 
norm in the vector subspace Vec(A). 

The algorithm is described in Figure [2j 



Input: parameters 02, 62, 5 . 

Initialize: Set b = (9 2 + a 2 )^2\og(2K/S). 

Pull randomly an arm x\ in Exploring (defined in 

Equation Q and observe ri 

Support Exploration Phase: 

while (i) max fc |# M | — % < or (ii) t < 

,f\ 5 do 

max fc \ s k,t\--Jj 

Pull randomly an arm xt in £ xp iorin g (defined in 
Equation [4| and observe rt 
Compute 9t using Equation [5] 
Set t <- t + 1 
end while 

Call T the length of the Support Exploration Phase 



Set A 



Restricted Linear Bandit Phase: 

Fort = T+l,...,n, apply CB 2 (B K r\Vec(A), S) and 
collect the rewards r t . 



Figure 2: The pseudo-code of the SL-UCB algorithm. 



Note that the algorithm computes 6k,t using 
h,t = j{2^ x k,i r i) = {—X t R t ) k . 



(5) 



enough, and applies CB 2 to the selected support. The 
particularity of this algorithm is that the length of the 
support exploration phase adjusts to the difficulty of 
finding the support: the length of this phase is of or- 
der O(jj^). More precisely, the smaller ||#||2, the 
more difficult the problem (since it is difficult to find 
the largest components of the support), and the longer 
the support exploration phase. But note that the re- 
gret does not deteriorate for small values of ||0||2 since 
in such case the loss at each step is small too. 

An interesting feature of SL-UCB is that it does not 
require the knowledge of the sparsity S of the param- 
eter. 

3 The gradient ascent as a bandit 
problem 

The aim of this section is to propose a gradient opti- 
mization technique to maximize a function / : M. K — > 
R when the dimension K is large compared to the num- 
ber of gradient steps n, i.e. n <C K. We assume that 
the function / depends on a small number of relevant 
variables: it corresponds to the assumption that the 
gradient of / is sparse. 

We consider a sto chastic gradient ascent (see for in- 
stance the book of iBertsekasI Jl999) for an exhaustive 
survey on gradient methods), where one estimates the 
gradient of / at a sequence of points and moves in the 
direction of the gradient estimate during n iterations. 

3.1 Formalization 



2.2 Main Result 

We first state an assumption on the noise. 
Assumption 1 (r)k,t)k,t is an i.i.d. white noise and 
3a k s.t. \rj k ,t\ < \<Jk- 

Note that this assumption is made for simplicity and 
that it could easily be generalized to, for instance, sub- 
Gaussian noise. Under this assumption, we have the 
following bound on the regret. 

Theorem 2 Under Assumption^ if we choose a 2 > 
IHI2, and 9~2 > H^lh, the regret of SL-UCB is bounded 
with probability at least 1 — 56, as 

Rn(Alg S L-uCB) < n8(e 2 + a 2 ) 2 \og(2K/6)SVn~. 

The proof of this result is reported in Section [U 

The algorithm SL-UCB first uses an idea of Com- 
pressed Sensing: it explores by performing random 
projections and builds an estimate of 9. It then se- 
lects the support as soon as the uncertainty is small 



The objective is to apply gradient ascent to a diffcr- 
entiable function / assuming that we are allowed to 
query this function n times only. We write Ut the 
t— th point where we sample /, and choose it such 
that ||u t+ i — ut\\ 2 = e, where e is the gradient step. 

Note that by the Theorem of intermediate values 

n 

f(Un) - f(Uo) = ^2 f(ut) - f(ut-l) 
t=l 



5^<(«t -«*_!), V/(tOt)), 



t=l 



where Wt is an appropriate barycenter of Ut and Ut—i- 

We can thus model the problem of gradient ascent by 
a linear bandit problem where the reward is what we 
gain/loose by moving from point Ut—i to point u tl 
i.e. f(ut) — f{ut-\). More precisely, rewriting this 
problem with previous notations, we have 9 + rj t — 
V/(w* JE and x t — u t — We illustrate this model 



°Note that in order for the model in Section \T\ to hold, 
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in Figured! 

If we assume that the function / is (locally) linear and 
that there are some i.i.d. measurement errors, we are 
exactly in the setting of Section [TJ The objective of 
minimizing the regret, i.e., 

Rn(Alg) = max f(x) - f(u n ), 

%£B2(uo ,ne) 

thus corresponds to the problem of maximizing f(u n ), 
the n-th evaluation of /. Thus the regret corresponds 
to the evaluation of / at the n-th step compared to 
an ideal gradient ascent (that assumes that the true 
gradient is known and followed for n steps). Apply- 
ing SL-UCB algorithm implies that the regret is in 
0{Sey/li). 

Remark on the noise: Assumption [TJ which states 
that the noise added to the function is of the form 
(ut — Ut-i,rjt) is specially suitable for gradient ascent 
because it corresponds to the cases where the noise is 
an approximation error and depends on the gradient 
step. 

Remark on the linearity assumption: Match- 
ing the stochastic bandit model in Section [TJ to the 
problem of gradient ascent corresponds to assuming 
that the function is (locally) linear in a neighbor- 
hood of ito, and that we have in this neighborhood 
f(u t+1 ) - f(u t ) = (u t+ i - w t ,V/(wo) + Vt+i), where 
the noise r/t+i is i.i.d. This setting is somehow restric- 
tive: we made it in order to offer a first, simple solu- 
tion for the problem. When the function is not linear, 

we need to relax the assumption that 77 is i.i.d.. 



one should also consider the additional approximation 
error. 

3.2 Numerical experiment 

In order to illustrate the mechanism of our algorithm, 
we apply SL-UCB to a quadratic function in dimen- 
sion 100 where only two dimensions are informative. 
Figure [4] shows with grey levels the projection of the 
function onto these two informative directions and a 
trajectory followed by n = 50 steps of gradient ascent. 
The beginning of the trajectory shows an erratic be- 
havior (see the zoom) due to the initial support explo- 
ration phase (the projection of the gradient steps onto 
the relevant directions are small and random). How- 
ever, the algorithm quickly selects the righ support of 
the gradient and the restricted linear bandit phase en- 
ables to follow very efficiently the gradient along the 
two relevant directions. 

We now want to illustrate the performances of SL- 
UCB on more complex problems. We fix the number 
of pulls to n = 100, and we try different values of K, 
in order to produce results for different values of the 
ratio — . The larger this ratio, the more difficult the 
problem. We choose a quadratic function that is not 
constant in S — 10 directions^]. 

We compare our algorithm SL-UCB to two strategies: 
the "oracle" gradient strategy (OGS), i.e. a gradient 
algorithm with access to the full gradient of the func- 



7 We keep the same function for different values of K. 
It is the quadratic function f(x) = J2kLi ~ 20(xk — 25) 2 . 
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Figure 4: Illustration of the trajectory of algorithm 
SL-UCB with a budget n = 50, with a zoom at the 
beginning of the trajectory to illustrate the support 
exploration phase. The levels of gray correspond to 
the contours of the function. 

tioiH, and the random best direction (BRD) strategy 
(i.e., at a given point, chooses a random direction, ob- 
serves the value of the function a step further in this 
direction, and moves to that point if the value of the 
function at this point is larger than its value at the 
previous point). In Figure El we report the difference 
between the value at the final point of the algorithm 
and the value at the beginning. 



K/n 


OGS 


SL-UCB 


BRD 


2 


1.875 10 b 


1.723 10 5 


2.934 10 4 


10 


1.875 10 b 


1.657 10 b 


1.335 10 4 


100 


1.875 10 b 


1.552 10 b 


5.675 10 a 



Figure 5: We report, for different values of ^ and 
different strategies, the value of f(u n ) — f(v>o). 

The performances of SL-UCB is (slightly) worse than 
the optimal "oracle" gradient strategy. This is due to 
the fact that SL-UCB is only given a partial informa- 
tion on the gradient. However it performs much better 
than the random best direction. Note that the larger 
— , the more important the improvements of SL-UCB 
over the random best direction strategy. This can be 
explained by the fact that the larger , the less prob- 
able it is that the random direction strategy picks a 
direction of interest, whereas our algorithm is designed 



Each of the 100 pulls corresponds to an access to the 
full gradient of the function at a chosen point. 



for efficiently selecting the relevant directions. 

4 Analysis of the SL-UCB algorithm 

4.1 Definition of a high-probability event £ 

Step 0: Bound on the variations of 6 t around 
its mean during the Support Exploration Phase 

Note that since Xk,t = or Xk.t = — during the 

Support Exploration Phase, the estimate (9 t of (9 during 
this phase is such that, for any t Q <T and any k 



K f \ 
Ok,t = -j-[^Z x k,trtJ 



K 



K 



to K 

( Y Xk tt ^2 x k',t(Qk' + Vk>,t) 



t=l 



fc'=l 



° t=i ° t=i k'^k 



K 



la 



K 



— Y x k',t r )k',t 

t=l k' = l 



tCfe' 



u t=l k'^k 
, to K 
+ — Y Y b k,k'.tVk',t, 
t=l k' = l 



(6) 



where b k .k't = Kx k ,tXk' 



Note that since the Xk,t are i.i.d. random variables 
such that Xk,t — -^g with probability 1/2 and 
Xk,t = wrtn probability 1/2, the (b k ,k> ,t)k'^k,t 

are i.i.d. Rademacher random variables, and bk,k,t — !■ 

Step 1: Study of the first term. Let us first study 

t^ J2tLi J2k'^k b k,k',t0k>- 

Note that the bk,k',t@k' are (K — 1)T zero-mean in- 
dependent random variables and that among them, 
Vfc' € {1, K}, t of them are bounded by Of.', i.e. the 
{bk,k J ,t@k')t- By Hoeffding's inequality, we thus have 
with probability 1-6 that \± £t=i J2k'^k b k,k>jOk> \ < 

\\e\\ 2 j2 l0g(2/«5) tv y 1 1 1 „ 

v ,— Now by using an union bound on all 

the k = {1, . . . , K}, we have w.p. 1 — 5, Vfc, 



u t=l k'^k 



k,k'tOk 



A < 



!y /2log(2K/S) 



(7) 



Step 2: Study of the second term. Let us now 

stud y £ Etli EfcLi h,k',tr]k',t- 
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Note that the (bk,k' ,tijk' ,t)fc' ,t are K^o independent 
zero-mean random variables, and that among these 
variables, Vfc G {1, K}, to of them are bounded 
by \&k- By Hoeffding's inequality, we thus have 
with probability 1-5, |^ J2tLi EfcLi b k,k>, tVk',t\ < 

||cr|| 2 V 2 l°g(2/5) , ■ -i j "J-T, "U 

!L yl^ . lhus by an union bound, with proba- 
bility 1 - 5, Vfc, 



t=i fc'=i 



Step 3: Final bound. Finally for a given to, with 
probability 1 — 25, we have by Equations HI [7] and [8] 



Step 3: Minimum length of the Support Ex- 
ploration Phase. If the first (i) criterion is verified 
then on £ by Equation [TTJ \6^» \ — > 0. If the second 
(ii) criterion is verified then on £ by Equation [TTJ we 
have t > . 

Combining those two results, we have on the event £ 
that T > max ^) > pj^v^- We write T mi „ = 

4.3 Description of the set A 

The set A is defined as A = |fc : |4,t| > ^=|. 



.. .„,-,| 2 + Hl2)V2iog(2ir/<s) 



Step 4: Definition of the event of interest. Now 

we consider the event £ such that 



£= f| jXM^k^V (io) 



where b = (6 2 + a 2 )y / 2\og(2K/5). 

From Equation [9] and an union bound over time, we 
deduce that P(£) > 1 - 2nd. 

4.2 Length of the Support Exploration Phase 

The Support Exploration Phase ends at the first time 
t such that (i) max fc \§k,t\ — ^ > and (ii) t > 

y/n_ 

max fc \0k,t\— ~tj 

Step 1: A result on the empirical best arm 

On the event £, we know that for any t and any fc, 
\6k\ ~ 7f < \kt\ < \0k\ + 7j- In particular for 
fc* = argmax/s \0k\ we have 



\0k*\-- s <max\9 k ,t\<\6 k . 

\Jt k 



Step 2: Maximum length of the Support Explo- 
ration Phase. If \6f-» | — ^= > then by EquationlTTl 

the first (i) criterion is verified on £. If t > - — ^-rr^/n 

then by Equation [TTJ the second (ii) criterion is veri- 
fied on £. 

Note that both those conditions are thus verified if 



9b- 



The Support Exploration 



t > max( FI7F ,3|^ 7 | / 
Phase stops thus before this moment. Note that as 
the budget of the algorithm is n, we have on £ that 
T < max (^, 3*0, n) < ^V^- We write 



Step 1: Arms that are in A Let us consider an 
arm fc such that \6 k \ > 3b VlW^ . Note that T > Tmin = 
■pyj^-yn on £. We thus know that on £ 



T n 



1/4 



1/4 " v /y 



T 



9VS6 2 



This means that fc G *4 on £. We thus know that 
\0k\> 3h i]fF 2 implies on £ that fc G A 



Step 2: Arms that are not in A Now let us con- 
sider an arm fc such that \9^\ < 2^7- Then on £, we 
know that 

.* . . b b b 3b 2b 

^ < ^ + VT < ^ + VT < WT < VT- 



This means that fc G A c on £. This implies that on £, 
if |0 fe | = 0, then fc G A°. 

Step 3: Summary. Finally, we know that A is com- 



posed of all the \9k\ > 3b i/f^ 2 , and that it contains 
only the strictly positive components Ok, be. at most 
S elements since 8 is S— sparse. We write A m [ n — {k ■ 

|ffc| > „l/4 }■ 



4.4 Comparison of the best element on „4 
and on Bk- 

Now let us compare m-ax. Xte vec(A)nB K x *) an< ^ 
max XteBji .(0, x t ). 

At first, note that max^gg^.^, xt) = ||0||a 
a nd that max a , t gy ee( ^) nBj< (6>,x t ) = H6UH2 = 

\fT,k=i k l { k e where = ^fc if fc G ^1 and 

0.4. fc = otherwise. This means that 
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max [tt, Xt) 

x t eB K 



max (0,xt) 

x t £Vec(A)nB K 



.{keA}\\ 2 = 



-\\0I{keA}\ 

+ ||01{fc G A}\ 



(12) 



4.5 Expression of the regret of the algorithm 

Assume that we run the algorithm CB2(Vec(A) fl 
Bki S, T) at time T where A C Supp(9) wi th a budget 
of n\ = n — T samples. In the paper ()Dani et all 
120081 ). they prove that on an event £ 2 (Vec(A) n 
Bk,S,T) of probability 1 — 8 the regret of algorithm 
CB 2 is bounded by Rn(Alg C B 2 (Vec(A)nB K ,s,T)) < 

64|^|(||0|| 2 + ||a|| 2 )(log(n 2 /'5)) 2 V^- 

Note that since A C Supp(9) 7 we have ^2(Vec(A) n 
B K , S T) c C-?.(Vec(Su pp(9)) n B K ,5,T) (see the pa- 
per (|Dani et all [2008) for more details on the event 
£2)- We thus now that, conditionally to T, with prob- 
ability 1 — 8, the regret is bounded for any A C 

Supp{9) as R n (Alg C B 2 (Vec(A)nB K ,s,T)) < 64S(||0|| 2 + 
||a|| 2 )(log(n 2 /5)) 2 Vnr. 



By an union bound on all possible values for 
T (i.e. from 1 to n), we obtain that on 
an event £2 whose probability is larger than 

1 - 6, Rn(AlgcB 2 (Vec(A)nB K ,S,T)) < 64S(||0|| 2 + 
|M| 2 Vlog(n 3 /<5)) 2 V^. 



We thus have on £U£ 2 , i-e. on an event with proba- 
bility larger than 1 — 26, that 

R n {Alg SL -ucB,S) < 2T max ||0|| 2 

+ maxR n (MgcB 2 (Ve.c{A)nB K ,5,t)) 



i( max (x, 6) — 1 
\xeB K xeB K n 



max (x, 9) 

K nVect{A m iv) 



By using this Equation, the maximal length of the 
support exploration phase T max deduced in Step 2 of 
Subsection 14.21 and Equation [12l we obtain on £ that 

R n < 64S(||0|| 2 + |k||2)(log(n 2 /<5)) 2 ^ 
< 118(02 + ( T 2 ) 2 log(2if/ ( 5)^. 



by using b = (#2 + a 2 ) \/2\og{2K / 5) for the third step. 
Conclusion 



been designed using ideas from Compressed Sensing 
and Bandit Theory. Compressed Sensing is used in 
the support exploration phase, in order to select the 
support of the parameter. A linear bandit algorithm 
is then applied to the small dimensional subspace de- 
fined in the first phase. We derived a regret bound 
of order 0(Sy/n). Note that the bound scales with 
the sparsity S of the unknown parameter 9 instead of 
the dimension K of the parameter (as is usually the 
case in linear bandits). We then provided an example 
of application for this setting, the optimization of a 
function in high dimension. Possible further research 
directions include: 

• The case when the support of 9 changes with time, 
for which it would be important to define assump- 
tions under which sub-linear regret is achievable. 
One idea would be to use techniques developed for 



Bar tlett et al 
l2009t 



2008 ; ICesa-Bianchi and Lugosi 



? 

dversarial ba n dits (s ee (Abcrnct hv et all 12008 : 



Koolen et all l201of 



but also ( Flaxman et al 



Audibert et al 
20051) 



2011). 



gradient-specific modeling) 
less /switch in g band it s (see e.g. 



for a more 
also from rest- 



( Whittle, 



Nino-Moral l200ll ISlivkins and Upfall . 



1988; 



2008; 



A. Garivierll201lh and many others). This would 
be particularly interesting to model gradient 
ascent for e.g. convex function where the support 
of the gradient is not constant. 



• Designing an improved analysis (or algorithm) 
in order to achieve a regret of order 0(y/Sn), 
which is the lower bound for the problem of lin- 
ear bandits in a space of dimension S. Note 
that when an upper bound 5" on the sparsity 
is available, it seems possible to obtain such 
a regret by replacing condition (ii) in the al- 



Vt 



and 



gorithm by t < —r. n 

iiM{e t , fc >^})ji 2 

usin g for the Exploitation phase the alg orithm 
in (jRusmevichientong and Tsitsiklisl . 120081 ). The 
regret of such an algorithm would be in 0(V S'n). 
But it is not clear whether it is possible to ob- 
tain such a result when no upper bound on S is 
available (as is the case for SL-UCB). 
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